Summary

First thing we want to do is to load data from A1 and A2, and install all required packages for this assignment.

A1

Assignment 1 was focused on cleaning data by removing outliers, mapping gene id to HUGO gene symbols, and normalizing data with TMM method. Gene set information are shown as below

description for dataset .

# First, we want to get the description for dataset GSE145412. 
gse <- getGEO("GSE145412",GSEMatrix=FALSE)
kable(data.frame(head(Meta(gse))), format = "html")
contact_address contact_city contact_country contact_email contact_institute contact_name
M. Sklodowskiej-Curie 24A Bialystok Poland Medical University of Bialystok Magdalena,,Paczkowska-Abdulsalam

information associated with the dataset.

# Then, we want to get information about the dataset
current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))
## File stored at:
## /tmp/RtmpJ1ickg/GPL18573.soft

Platform Title: Illumina NextSeq 500 (Homo sapiens)
Original submission date: Apr 15 2014
Last update date: Mar 26 2019
Organism: Homo sapiens
No. of GEO datasets that use this technology: 5133
No. of GEO samples that use this technology: 204776

normalized data are in the format below

A1_normalized = readRDS("normalized_data.rds")
kable(A1_normalized[1:5,1:5], type="html")
sample_1 sample_4 sample_5 sample_6 sample_7
7SK 1.8665112 0.4927562 2.3163723 0.6257193 0.4934407
A1BG-AS1 0.2201429 0.2269133 0.2236347 0.1269422 0.1814833
A2M-AS1 0.0419011 0.0575317 0.0302224 0.0658098 0.0579427
A3GALT2 0.0448311 0.0154599 0.0187952 0.0399286 0.0213537
AAAS 0.4379074 0.4786196 0.4105074 0.5047566 0.4477859

A2

The normalized data from Assignment #1 was ranked according to differential expression in Assignment 2. Then, we split up-regulated genes and down-regulated genes and store then into two files. After that, threshold over-representation analysis (g:Profiler) was performed on this data.and the top results are being analyzed.

The results are shown as below

Figure 1: differential expression genes overview

After that, we saved the rank list into a text file under data folder. Now we need to convert this text file to a rnk file in order to run GSEA.

Non-thresholded Gene set Enrichment Analysis

  1. I choose to use GSEA for the Non-thresholded Gene set Enrichment Analysis, The geneset I used was from the baderlab geneset collection from March 1, 2021 containing GO biological process, no IEA and pathways.

convert txt file to rnk file

temp_data <- read.delim("data/mets_ranked_genelist.txt", header = FALSE)

write.table(x=temp_data, file=file.path("data","mets_ranked_genelist.rnk"),sep = "\t", row.names = FALSE,col.names = FALSE,quote = FALSE)

kable(temp_data[1:5,1:2], type="html")
V1 V2
MYL9 -3.672864
PF4 -3.473816
NRGN -3.172615
S100A11 -2.940113
OAZ1 -2.769050

Now we have a rnk file, next step is to run GSEA program

GSEA setup

#path to GSEA jar 
gsea_jar <- params$gsea_jar
java_version <- params$java_version

#Gsea takes a long time to run.  If you have already run GSEA manually or previously there is no need to re-run GSEA.  Make sure the 
# gsea results are in the current directory and the notebook will be able to find them and use them.
run_gsea = params$run_gsea

#navigate to the directory where you put the downloaded protocol files.
working_dir <- params$working_dir

gsea_directory = params$gsea_directory

analysis_name <- params$analysis_name
rnk_file <- params$rnk_file
expression_file <- params$expression_file
classes_file <- params$classes_file

is_docker <- params$is_docker

Download the latest pathway definition file

gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"

#list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)

#get the gmt that has all the pathways and does not include terms inferred from electronic annotations(IEA)
#start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)",
  contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))

dest_gmt_file <- file.path(working_dir,paste("Supplementary_Table3_",gmt_file,sep="") )

download.file(
    paste(gmt_url,gmt_file,sep=""),
    destfile=dest_gmt_file
)

Run GSEA

run GSEA with parameters. set maximum geneset size of 200, minimum geneset size of 15 and gene set permutation

if(run_gsea && java_version == "11"){
  command <- paste("",gsea_jar,  "GSEAPreRanked -gmx", dest_gmt_file, "-rnk" ,file.path(working_dir,rnk_file), "-collapse false -nperm 1000 -scoring_scheme weighted -rpt_label ",analysis_name,"  -plot_top_x 20 -rnd_seed 12345  -set_max 200 -set_min 15 -zip_report false -out" ,working_dir, " > gsea_output.txt",sep=" ")
  system(command)
} else if (run_gsea) {
  command <- paste("java  -Xmx1G -cp",gsea_jar,  "xtools.gsea.GseaPreranked -gmx", dest_gmt_file, "-rnk" ,file.path(working_dir,rnk_file), "-collapse false -nperm 1000 -permute gene_set -scoring_scheme weighted -rpt_label ",analysis_name,"  -num 100 -plot_top_x 20 -rnd_seed 12345  -set_max 200 -set_min 15 -zip_report false -out" ,working_dir, "-gui false > gsea_output.txt",sep=" ")
  system(command)
}

get output directory

if(gsea_directory == ""){
  gsea_directories <- list.files(path = working_dir, pattern = "\\.GseaPreranked")
  #get the details on the files
  details = file.info(file.path(getwd(),working_dir,gsea_directories))
  #order according to newest to oldest
  details = details[with(details, order(as.POSIXct(mtime),decreasing = TRUE)), ]
  #use the newest file:
  gsea_output_dir <- row.names(details)[1]
} else {
  gsea_output_dir <- gsea_directory
}

Result

All the result are stored in MetS.GseaPreranked folder under data directory. To obtain all the result for positive or negative. Below shows the top hits for postive and negative result.

gsea_dir <- file.path(getwd(),"data", gsea_directories[1])
#get the gsea result files  
gsea_results_files <- list.files(path = gsea_dir,
                                 pattern = "gsea_report_*.*.tsv")
#there should be 2 gsea results files  
enr_file1 <- read.table(file.path(gsea_dir, 
                        gsea_results_files[grepl("neg", gsea_results_files)]),
                        header = TRUE, sep = "\t", quote="\"",
                        stringsAsFactors = FALSE,row.names=1)  
enr_file2 <- read.table(file.path(gsea_dir, 
                        gsea_results_files[grepl("pos", gsea_results_files)]),
                        header = TRUE, sep = "\t", quote="\"",
                        stringsAsFactors = FALSE,row.names=1)

knitr::kable(enr_file1[1:20,c(1,4:8)], "html", 
             caption = "Top Down Regulated Enrichment Analysis Result")
Top Down Regulated Enrichment Analysis Result
GS.br..follow.link.to.MSigDB ES NES NOM.p.val FDR.q.val FWER.p.val
NEUTROPHIL MIGRATION%GOBP%GO:1990266 NEUTROPHIL MIGRATION%GOBP%GO:1990266 -0.8829091 -1.907374 0 0.0000000 0.000
NEUTROPHIL CHEMOTAXIS%GOBP%GO:0030593 NEUTROPHIL CHEMOTAXIS%GOBP%GO:0030593 -0.8897292 -1.895108 0 0.0000000 0.000
ANTIMICROBIAL HUMORAL IMMUNE RESPONSE MEDIATED BY ANTIMICROBIAL PEPTIDE%GOBP%GO:0061844 ANTIMICROBIAL HUMORAL IMMUNE RESPONSE MEDIATED BY ANTIMICROBIAL PEPTIDE%GOBP%GO:0061844 -0.8848908 -1.869164 0 0.0002544 0.001
GRANULOCYTE CHEMOTAXIS%GOBP%GO:0071621 GRANULOCYTE CHEMOTAXIS%GOBP%GO:0071621 -0.8671735 -1.857884 0 0.0008142 0.004
ANTIMICROBIAL PEPTIDES%REACTOME%R-HSA-6803157.2 ANTIMICROBIAL PEPTIDES%REACTOME%R-HSA-6803157.2 -0.8803539 -1.853803 0 0.0010194 0.006
GRANULOCYTE MIGRATION%GOBP%GO:0097530 GRANULOCYTE MIGRATION%GOBP%GO:0097530 -0.8572405 -1.848430 0 0.0013104 0.009
ANTIMICROBIAL HUMORAL RESPONSE%GOBP%GO:0019730 ANTIMICROBIAL HUMORAL RESPONSE%GOBP%GO:0019730 -0.8383114 -1.844298 0 0.0011466 0.009
MICROGLIA PATHOGEN PHAGOCYTOSIS PATHWAY%WIKIPATHWAYS_20210210%WP3937%HOMO SAPIENS MICROGLIA PATHOGEN PHAGOCYTOSIS PATHWAY%WIKIPATHWAYS_20210210%WP3937%HOMO SAPIENS -0.9002324 -1.833128 0 0.0018092 0.016
SELENIUM MICRONUTRIENT NETWORK%WIKIPATHWAYS_20210210%WP15%HOMO SAPIENS SELENIUM MICRONUTRIENT NETWORK%WIKIPATHWAYS_20210210%WP15%HOMO SAPIENS -0.8663569 -1.827882 0 0.0016283 0.016
CELL KILLING%GOBP%GO:0001906 CELL KILLING%GOBP%GO:0001906 -0.8421238 -1.824320 0 0.0015736 0.017
PLATELET DEGRANULATION%REACTOME DATABASE ID RELEASE 75%114608 PLATELET DEGRANULATION%REACTOME DATABASE ID RELEASE 75%114608 -0.8028120 -1.820281 0 0.0015278 0.018
RESPONSE TO ELEVATED PLATELET CYTOSOLIC CA2+%REACTOME%R-HSA-76005.2 RESPONSE TO ELEVATED PLATELET CYTOSOLIC CA2+%REACTOME%R-HSA-76005.2 -0.7953726 -1.816867 0 0.0016461 0.021
MYELOID LEUKOCYTE MIGRATION%GOBP%GO:0097529 MYELOID LEUKOCYTE MIGRATION%GOBP%GO:0097529 -0.8314753 -1.815126 0 0.0015285 0.021
TYROBP CAUSAL NETWORK%WIKIPATHWAYS_20210210%WP3945%HOMO SAPIENS TYROBP CAUSAL NETWORK%WIKIPATHWAYS_20210210%WP3945%HOMO SAPIENS -0.8229974 -1.813868 0 0.0014946 0.022
LEUKOCYTE CHEMOTAXIS%GOBP%GO:0030595 LEUKOCYTE CHEMOTAXIS%GOBP%GO:0030595 -0.8142357 -1.808093 0 0.0016569 0.026
COFACTOR CATABOLIC PROCESS%GOBP%GO:0051187 COFACTOR CATABOLIC PROCESS%GOBP%GO:0051187 -0.8381607 -1.803955 0 0.0019194 0.032
ROS AND RNS PRODUCTION IN PHAGOCYTES%REACTOME%R-HSA-1222556.9 ROS AND RNS PRODUCTION IN PHAGOCYTES%REACTOME%R-HSA-1222556.9 -0.8726843 -1.802897 0 0.0018694 0.033
CELL REDOX HOMEOSTASIS%GOBP%GO:0045454 CELL REDOX HOMEOSTASIS%GOBP%GO:0045454 -0.8497583 -1.801876 0 0.0018250 0.034
PLATELET DEGRANULATION%GOBP%GO:0002576 PLATELET DEGRANULATION%GOBP%GO:0002576 -0.7945729 -1.799563 0 0.0020400 0.040
DISRUPTION OF CELLS OF OTHER ORGANISM%GOBP%GO:0044364 DISRUPTION OF CELLS OF OTHER ORGANISM%GOBP%GO:0044364 -0.9143146 -1.797062 0 0.0022332 0.046
knitr::kable(enr_file2[1:20,c(1,4:8)], "html", 
             caption = "Top Up Regulated Enrichment Analysis Result")
Top Up Regulated Enrichment Analysis Result
GS.br..follow.link.to.MSigDB ES NES NOM.p.val FDR.q.val FWER.p.val
NEGATIVE REGULATION OF TYPE I INTERFERON-MEDIATED SIGNALING PATHWAY%GOBP%GO:0060339 NEGATIVE REGULATION OF TYPE I INTERFERON-MEDIATED SIGNALING PATHWAY%GOBP%GO:0060339 0.8149089 2.196550 0.0000000 0.0365328 0.042
NEGATIVE REGULATION OF UBIQUITIN-PROTEIN TRANSFERASE ACTIVITY%GOBP%GO:0051444 NEGATIVE REGULATION OF UBIQUITIN-PROTEIN TRANSFERASE ACTIVITY%GOBP%GO:0051444 0.7817173 2.108648 0.0000000 0.0759830 0.166
REGULATION OF DEFENSE RESPONSE TO VIRUS%GOBP%GO:0050688 REGULATION OF DEFENSE RESPONSE TO VIRUS%GOBP%GO:0050688 0.6180260 2.085647 0.0000000 0.0696607 0.223
THE HUMAN IMMUNE RESPONSE TO TUBERCULOSIS%WIKIPATHWAYS_20210210%WP4197%HOMO SAPIENS THE HUMAN IMMUNE RESPONSE TO TUBERCULOSIS%WIKIPATHWAYS_20210210%WP4197%HOMO SAPIENS 0.7641495 2.076919 0.0000000 0.0598717 0.249
POSITIVE REGULATION OF DEFENSE RESPONSE TO VIRUS BY HOST%GOBP%GO:0002230 POSITIVE REGULATION OF DEFENSE RESPONSE TO VIRUS BY HOST%GOBP%GO:0002230 0.7229613 2.067987 0.0000000 0.0532673 0.269
REGULATION OF DEFENSE RESPONSE TO VIRUS BY HOST%GOBP%GO:0050691 REGULATION OF DEFENSE RESPONSE TO VIRUS BY HOST%GOBP%GO:0050691 0.6435527 2.008794 0.0076923 0.0842204 0.449
REGULATION OF ALTERNATIVE MRNA SPLICING, VIA SPLICEOSOME%GOBP%GO:0000381 REGULATION OF ALTERNATIVE MRNA SPLICING, VIA SPLICEOSOME%GOBP%GO:0000381 0.6250964 2.000386 0.0000000 0.0801236 0.485
HALLMARK_INTERFERON_ALPHA_RESPONSE%MSIGDB_C2%HALLMARK_INTERFERON_ALPHA_RESPONSE HALLMARK_INTERFERON_ALPHA_RESPONSE%MSIGDB_C2%HALLMARK_INTERFERON_ALPHA_RESPONSE 0.5657241 1.965435 0.0000000 0.1069109 0.641
REGULATION OF UBIQUITIN PROTEIN LIGASE ACTIVITY%GOBP%GO:1904666 REGULATION OF UBIQUITIN PROTEIN LIGASE ACTIVITY%GOBP%GO:1904666 0.7040978 1.924069 0.0000000 0.1423176 0.792
TYPE I INTERFERON SIGNALING PATHWAY%GOBP%GO:0060337 TYPE I INTERFERON SIGNALING PATHWAY%GOBP%GO:0060337 0.5328547 1.883854 0.0000000 0.1884126 0.898
HALLMARK_INTERFERON_GAMMA_RESPONSE%MSIGDB_C2%HALLMARK_INTERFERON_GAMMA_RESPONSE HALLMARK_INTERFERON_GAMMA_RESPONSE%MSIGDB_C2%HALLMARK_INTERFERON_GAMMA_RESPONSE 0.5088842 1.877887 0.0000000 0.1793557 0.906
ANTIVIRAL MECHANISM BY IFN-STIMULATED GENES%REACTOME%R-HSA-1169410.4 ANTIVIRAL MECHANISM BY IFN-STIMULATED GENES%REACTOME%R-HSA-1169410.4 0.4983351 1.848086 0.0000000 0.2123345 0.954
RESPONSE TO TYPE I INTERFERON%GOBP%GO:0034340 RESPONSE TO TYPE I INTERFERON%GOBP%GO:0034340 0.5279398 1.842471 0.0000000 0.2066486 0.958
REGULATION OF TYPE I INTERFERON-MEDIATED SIGNALING PATHWAY%GOBP%GO:0060338 REGULATION OF TYPE I INTERFERON-MEDIATED SIGNALING PATHWAY%GOBP%GO:0060338 0.6174607 1.818549 0.0000000 0.2348676 0.984
MYOBLAST DIFFERENTIATION%GOBP%GO:0045445 MYOBLAST DIFFERENTIATION%GOBP%GO:0045445 0.6752113 1.814272 0.0143541 0.2268365 0.986
CELLULAR RESPONSE TO TYPE I INTERFERON%GOBP%GO:0071357 CELLULAR RESPONSE TO TYPE I INTERFERON%GOBP%GO:0071357 0.5328547 1.813603 0.0000000 0.2134491 0.986
HOST-PATHOGEN INTERACTION OF HUMAN CORONA VIRUSES - INTERFERON INDUCTION%WIKIPATHWAYS_20210210%WP4880%HOMO SAPIENS HOST-PATHOGEN INTERACTION OF HUMAN CORONA VIRUSES - INTERFERON INDUCTION%WIKIPATHWAYS_20210210%WP4880%HOMO SAPIENS 0.5833607 1.807796 0.0000000 0.2100040 0.988
TYPE I INTERFERON INDUCTION AND SIGNALING DURING SARS-COV-2 INFECTION%WIKIPATHWAYS_20210210%WP4868%HOMO SAPIENS TYPE I INTERFERON INDUCTION AND SIGNALING DURING SARS-COV-2 INFECTION%WIKIPATHWAYS_20210210%WP4868%HOMO SAPIENS 0.5841016 1.782053 0.0206897 0.2429643 0.994
CD4-POSITIVE, ALPHA-BETA T CELL DIFFERENTIATION%GOBP%GO:0043367 CD4-POSITIVE, ALPHA-BETA T CELL DIFFERENTIATION%GOBP%GO:0043367 0.6419525 1.759785 0.0228571 0.2701851 0.999
REGULATION OF NUCLEASE ACTIVITY%GOBP%GO:0032069 REGULATION OF NUCLEASE ACTIVITY%GOBP%GO:0032069 0.6240582 1.759063 0.0119048 0.2584494 0.999
  1. NEUTROPHIL MIGRATION was the top hit in na_negative(down-regulated) group with P-value: 0.00, ES: -0.88, NES: -1.91, FDR: 0.00. There are 38 genes in its leading edge. PF4 is the top gene associated with this geneset NEGATIVE REGULATION OF TYPE I INTERFERON-MEDIATED SIGNALING PATHWAY was the top hit in na_positive(up-regulated) group with P-value: 0.00, ES: 0.81, NES: 2.20, FDR: 0.037. There are 15 genes in its leading edge. OAS3 is the top gene associated with this geneset

Compare to A2

Reload the result from A2 Figure 2.1: down-regulated result from A2 Figure 2.2: up regulated result from A2

  1. When compare the result with A2, we can see the result didn’t change too much. both of them are related to neutrophil activation. For the up-regulated group, the tophits are different. But since the normal group did not have lots of data, so we can not make any conclusion here. Over all, the result from GSEA and the result from g:Profiler are consistent.

Visualize Gene set Enrichment Analysis in Cytoscape

setup Cytoscape

I decide to use Cytoscape locally. Since the Cytoscape can not recognize “—”, so I opened pos and neg result file and mannually changed “—” to 1. The app we use was EnrichmentMap in Cytoscape

There are 803 nodes and 4982 edges in the result. The parameter I used was FDR set to 0.1, node cutoff(p-value) set to 1.0, edge cutoff set to 0.375(default). The blue node shows the down-regulated group, and the red node shows the up-regulated group. The initial network was shown below.

Figure 3: Initial result

Annotate network

For annotation, I used the AutoAnnotate app in Cytoscape. All parameter are set to default. Cluster algorithm is MCL Cluster, the edge weight column set to similarity_coefficient, Max words per label was 3, adjacent word bonus was set to 8. below shows all the parameters for annotation.

Figure 4: Annotation setting

publication ready figure

The publication ready figure was shown as below

Figure 5: Annotation setting

Theme network

From the graph, there are two large cluster which are antigen proteasome degradation (60 nodes) and glycolytic process biosynthetic(67 nodes). Most of the node are blue, which means the gene was down-regulated.

After create summarized network, the major themes present is antigen proteasome degradation, this related to immunity responds(Rivett,A.J, 2004). This can related to the result from GSEA and g:Profiler. The novel pathway would be glycolytic process biosynthetic since this pathway was not really mentioned in either GSEA and g:Profiler results.

A summarized network was shown below, cluster layout was set to CoSE

Figure 6: Summarized network

Interpretation

  1. The enrichment result not really close to the original paper. It is more close to the g:profiler result from thresholded methods. The difference are in the up-regulated group, but since this group only has a small number of data(only one red dot on the above graph), so the result change can not tell us too much.

  2. There is evidence from Andersen et al.(2016), the article stated that MetS and obesity could cause inflammation and trigger immunu responds, this evidence directly supports the finding. This support that The top result for down-regulated genes are related to immune response, this could associated with MetS and obesity.

Dark matter analysis

capture Dark matter

gmt_file <- file.path("data", "Supplementary_Table3_Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt")

capture.output(genesets <- GSA::GSA.read.gmt(gmt_file), file = "gsa_loud.out")

# give proper indices/names
names(genesets$genesets) <- genesets$geneset.names

expression <- readRDS("normalized_data.rds")

# Load genes from enriched pathway 
all_enr_genesets <- c(rownames(enr_file1), rownames(enr_file2)) 
genes_enr_gs <- c()

for(i in 1:length(all_enr_genesets)){
  current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_enr_genesets[i])])
  genes_enr_gs <- union(genes_enr_gs, current_geneset)
}

ranks <- read.table(file.path(getwd(), "data", "ranked_list.tsv"),
                    header = TRUE, sep = "\t", quote = "\"",
                    stringsAsFactors = FALSE)

Calculate dark matter

FDR_threshold <- 0.001
#get the genes from the set of enriched pathwasy (no matter what threshold)
all_sig_enr_genesets<- c(rownames(enr_file1)[which(enr_file1[,"FDR.q.val"]<=FDR_threshold)], rownames(enr_file2)[which(enr_file2[,"FDR.q.val"]<=FDR_threshold)])
genes_sig_enr_gs <- c()
for(i in 1:length(all_sig_enr_genesets)){
  current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_sig_enr_genesets[i])]) 
  genes_sig_enr_gs <- union(genes_sig_enr_gs, current_geneset)
}

genes_all_gs <- unique(unlist(genesets$genesets))

Venn Diagram of Dark Matter

A <- genes_all_gs
B <- genes_enr_gs
C <- rownames(expression)

venn_diagram_path <- file.path(getwd(),"image","dark_matter_overlaps.png")
png(venn_diagram_path)
VennDiagram::draw.triple.venn( area1=length(A), area2=length(B), area3 = length(C),
                  n12 = length(intersect(A,B)), n13=length(intersect(A,C)),
                  n23 = length(intersect(B,C)), 
                  n123 = length(intersect(A,intersect(B,C))),
                  category = c("all genesets","all enrichment results","expression"),
                  fill = c("red","green","blue"),
                  cat.col = c("red","green","blue")
)
dev.off()

Figure 7: Venn Diagram

Get gene set

# significant genes that are not annotated to any pathways in entire set of pathways used for the analysis
genes_no_annotation <- setdiff(C, A)

ranked_gene_no_annotation <- ranks[which(ranks[,1] %in% genes_no_annotation),]
knitr::kable(ranked_gene_no_annotation[1:10,], "html", caption = "Ranked gene without annotation")
Ranked gene without annotation
NAME TITLE SCORE X
3 MTCO3P12 NA 1.5206945 NA
23 IGHJ4 NA 0.8735703 NA
24 RP11-290D2.6 NA 0.8424625 NA
31 IGKJ5 NA 0.6906748 NA
34 EVI2A NA 0.6600435 NA
43 TRAJ7 NA 0.5888515 NA
44 IGHJ6 NA 0.5831147 NA
45 TRDD3 NA 0.5829422 NA
49 IGKJ4 NA 0.5590599 NA
52 IGHJ3P NA 0.5412626 NA
# significant genes that are not annotated to any of the pathways returned in the enrichment analysis
genes_no_annotation_enr <- setdiff(C, B)

ranked_gene_no_annotation_enr <- ranks[which(ranks[,1] %in% genes_no_annotation_enr),]
knitr::kable(ranked_gene_no_annotation[1:10,], "html", caption = "Ranked gene without annotation enrichment analysis")
Ranked gene without annotation enrichment analysis
NAME TITLE SCORE X
3 MTCO3P12 NA 1.5206945 NA
23 IGHJ4 NA 0.8735703 NA
24 RP11-290D2.6 NA 0.8424625 NA
31 IGKJ5 NA 0.6906748 NA
34 EVI2A NA 0.6600435 NA
43 TRAJ7 NA 0.5888515 NA
44 IGHJ6 NA 0.5831147 NA
45 TRDD3 NA 0.5829422 NA
49 IGKJ4 NA 0.5590599 NA
52 IGHJ3P NA 0.5412626 NA

Visualize Dark Matter with Heatmaps

we will use the same ComplexHeatmap and circlize R packages as in A2. 9,10,11

First, I will show the heatmap of any significant genes that are not annotated to any of the pathways returned in the enrichment analysis.

#get genes
enrich_heatmap_matrix <- expression[rownames(expression) %in% ranked_gene_no_annotation_enr$NAME,]

#normalize
enrich_heatmap_matrix <- t(scale(t(enrich_heatmap_matrix)))

if(min(enrich_heatmap_matrix) == 0){
  heatmap_col = colorRamp2(c( 0, max(enrich_heatmap_matrix)),
                           c( "white", "red"))
 } else {
   heatmap_col = colorRamp2(c(min(enrich_heatmap_matrix), 0,
                              max(enrich_heatmap_matrix)),
                            c("blue", "white", "red"))
 }
enrich_darkmatter_heatmap <- Heatmap(as.matrix(enrich_heatmap_matrix),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                           show_row_dend = TRUE,
                           show_column_dend = TRUE,
                           col=heatmap_col,
                           show_column_names = TRUE,
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE,
 )
enrich_darkmatter_heatmap

Then, I will show the heatmap of any significant genes that are not annotated to any pathways in entire set of pathways used for the analysis

#get genes
all_heatmap_matrix <- expression[rownames(expression) %in% ranked_gene_no_annotation$NAME,]

#normalize
all_heatmap_matrix <- t(scale(t(all_heatmap_matrix)))

if(min(all_heatmap_matrix) == 0){
  heatmap_col = colorRamp2(c( 0, max(all_heatmap_matrix)),
                           c( "white", "red"))
 } else {
   heatmap_col = colorRamp2(c(min(all_heatmap_matrix), 0,
                              max(all_heatmap_matrix)),
                            c("blue", "white", "red"))
 }
all_darkmatter_heatmap <- Heatmap(as.matrix(all_heatmap_matrix),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                           show_row_dend = TRUE,
                           show_column_dend = TRUE,
                           col=heatmap_col,
                           show_column_names = TRUE,
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE,
 )
all_darkmatter_heatmap

References

Andersen, Catherine J, Kelsey E Murphy, and Maria Luz Fernandez. 2016. “Impact of Obesity and Metabolic Syndrome on Immunity.” Advances in Nutrition 7 (1): 66–75. https://doi.org/10.3945/an.115.010207.

Chen, Hanbo. 2018. VennDiagram: Generate High-Resolution Venn and Euler Plots.

Davis, Sean, and Paul Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (Geo) and Bioconductor.” Bioinformatics 14: 1846–7.

Efron, Brad, and R. Tibshirani. 2019. GSA: Gene Set Analysis. http://www-stat.stanford.edu/~tibs/GSA.

Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.

Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in R.” Bioinformatics 30 (19): 2811–2.

Jassal, Bijay, Lisa Matthews, Guilherme Viteri, Chuqiao Gong, Pascual Lorente, Antonio Fabregat, Konstantinos Sidiropoulos, et al. 2019. “The Reactome Pathway Knowledgebase.” Nucleic Acids Research. https://doi.org/10.1093/nar/gkz1031.

Kanehisa, Minoru. 2019. “Toward Understanding the Origin and Evolution of Cellular Organisms.” Protein Science 28 (11): 1947–51. https://doi.org/10.1002/pro.3715.

Kanehisa, Minoru, Miho Furumichi, Yoko Sato, Mari Ishiguro-Watanabe, and Mao Tanabe. 2020. “KEGG: Integrating Viruses and Cellular Organisms.” Nucleic Acids Research 49 (D1). https://doi.org/10.1093/nar/gkaa970.

McCarthy, Davis J., Yunshun Chen, and Gordon K. Smyth. 2012. “Differential Expression Analysis of Multifactor Rna-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.

Mi, Huaiyu, Anushya Muruganujan, Dustin Ebert, Xiaosong Huang, and Paul D Thomas. 2018. “PANTHER Version 14: More Genomes, a New Panther Go-Slim and Improvements in Enrichment Analysis Tools.” Nucleic Acids Research 47 (D1). https://doi.org/10.1093/nar/gky1038.

Mootha, Vamsi K, Cecilia M Lindgren, Karl-Fredrik Eriksson, Aravind Subramanian, Smita Sihag, Joseph Lehar, Pere Puigserver, et al. 2003. “PGC-1α-Responsive Genes Involved in Oxidative Phosphorylation Are Coordinately Downregulated in Human Diabetes.” Nature Genetics 34 (3): 267–73. https://doi.org/10.1038/ng1180.

Morgan, Martin. 2019. BiocManager: Access the Bioconductor Project Package Repository. https://CRAN.R-project.org/package=BiocManager.

Müller, Kirill, Hadley Wickham, David A. James, and Seth Falcon. 2020. RSQLite: ’SQLite’ Interface for R. https://CRAN.R-project.org/package=RSQLite.

Paczkowska-Abdulsalam, Magdalena, Magdalena Niemira, Agnieszka Bielska, Anna Szałkowska, Beata Anna Raczkowska, Sini Junttila, Attila Gyenesei, et al. 2020. “Evaluation of Transcriptomic Regulations Behind Metabolic Syndrome in Obese and Lean Subjects.” International Journal of Molecular Sciences 21 (4): 1455. https://doi.org/10.3390/ijms21041455.

Reimand, Jüri, Meelis Kull, Hedi Peterson, Jaanus Hansen, and Jaak Vilo. 2007. “G:Profiler—a Web-Based Toolset for Functional Profiling of Gene Lists from Large-Scale Experiments.” Nucleic Acids Research 35. https://doi.org/10.1093/nar/gkm226.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Rivett, A. J., and Arron R. Hearn. 2004. “Proteasome Function in Antigen Presentation: Immunoproteasome Complexes, Peptide Production, and Interactions with Viral Proteins.” Current Protein and Peptide Science 5 (3): 153–61. https://doi.org/10.2174/1389203043379774.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Shannon, P. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13 (11): 2498–2504. https://doi.org/10.1101/gr.1239303.

Slenter, Denise N, Martina Kutmon, Kristina Hanspers, Anders Riutta, Jacob Windsor, Nuno Nunes, Jonathan Mélius, et al. 2017. “WikiPathways: A Multifaceted Pathway Database Bridging Metabolomics to Other Omics Research.” Nucleic Acids Research 46 (D1). https://doi.org/10.1093/nar/gkx1064.

Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

Temple Lang, Duncan. 2020. RCurl: General Network (Http/Ftp/...) Client Interface for R. https://CRAN.R-project.org/package=RCurl.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Romain Fran?ois, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Xie, Yihui. 2020. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.

Zhu, Yuelin, Sean Davis, Robert Stephens, Paul S. Meltzer, and Yidong Chen. 2008. “GEOmetadb: Powerful Alternative Search Engine for the Gene Expression Omnibus.” Bioinformatics (Oxford, England) 24 (23): 2798–2800. https://doi.org/10.1093/bioinformatics/btn520.